NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

SketchAgent: Language-Driven Sequential Sketch Generation

Vinker, Yael; Shaham, Tamar Rott; Zheng, Kristine; Zhao, Alex; Fan, Judith E; Torralba, Antonio (June 2025, Computer Vision Foundation)

Sketching serves as a versatile tool for externalizing ideas, enabling rapid exploration and visual communication that spans various disciplines. While artificial systems have driven substantial advances in content creation and human-computer interaction, capturing the dynamic and abstract nature of human sketching remains challenging. In this work, we introduce SketchAgent, a language-driven, sequential sketch generation method that enables users to create, modify, and refine sketches through dynamic, conversational interactions. Our approach requires no training or fine-tuning. Instead, we leverage the sequential nature and rich prior knowledge of off-the-shelf multimodal large language models (LLMs). We present an intuitive sketching language, introduced to the model through in-context examples, enabling it to “draw” using string-based actions. These are processed into vector graphics and then rendered to create a sketch on a pixel canvas, which can be accessed again for further tasks. By drawing stroke by stroke, our agent captures the evolving, dynamic qualities intrinsic to sketching. We demonstrate that SketchAgent can generate sketches from diverse prompts, engage in dialogue-driven drawing, and collaborate meaningfully with human users.
more » « less
Free, publicly-accessible full text available June 15, 2026
Open Vocabulary Semantic Scene Sketch Understanding

Bourouis, Ahmed; Fan, Judith E; Gryaditskaya, Yulia (June 2024, Computer Vision Foundation)

We study the underexplored but fundamental vision problem of machine understanding of abstract freehand scene sketches We introduce a sketch encoder that results in semantically- aware feature space, which we evaluate by testing its performance on a semantic sketch seg- mentation task. To train our model we rely only on the availability of bitmap sketches with their brief captions and do not require any pixel-level annotations. To obtain generalization to a large set of sketches and categories, we build on a vision transformer encoder pretrained with the CLIP model. We freeze the text encoder and perform visual-prompt tuning of the visual encoder branch while introducing a set of critical modifications. Firstly, we augment the classical key-query (k-q) self-attention blocks with value-value (v-v) self-attention blocks. Central to our model is a two-level hierarchical network design that enables efficient semantic disentanglement: The first level ensures holistic scene sketch encoding, and the second level focuses on individual categories. We, then, in the second level of the hierarchy, introduce a cross-attention between textual and visual branches. Our method outperforms zero-shot CLIP pixel accuracy of segmentation results by 37 points, reaching an accuracy of 85.5% on the FS-COCO sketch dataset. Finally, we conduct a user study that allows us to identify further improvements needed over our method to reconcile machine and human understanding of scene sketches.
more » « less
Full Text Available
Creating ad hoc graphical representations of number

https://doi.org/10.1016/j.cognition.2023.105665

Holt, Sebastian; Fan, Judith E.; Barner, David (January 2024, Cognition)

The ability to communicate about exact number is critical to many modern human practices spanning science, industry, and politics. Although some early numeral systems used 1-to-1 correspondence (e.g., ‘IIII' to represent 4), most systems provide compact representations via more arbitrary conventions (e.g., ‘7’ and ‘VII'). When people are unable to rely on conventional numerals, however, what strategies do they initially use to communicate number? Across three experiments, participants used pictures to communicate about visual arrays of objects containing 1–16 items, either by producing freehand drawings or combining sets of visual tokens. We analyzed how the pictures they produced varied as a function of communicative need (Experiment 1), spatial regularities in the arrays (Experiment 2), and visual properties of tokens (Experiment 3). In Experiment 1, we found that participants often expressed number in the form of 1-to-1 representations, but sometimes also exploited the configuration of sets. In Experiment 2, this strategy of using configural cues was exaggerated when sets were especially large, and when the cues were predictably correlated with number. Finally, in Experiment 3, participants readily adopted salient numerical features of objects (e.g., four-leaf clover) and generally combined them in a cumulative-additive manner. Taken together, these findings corroborate historical evidence that humans exploit correlates of number in the external environment – such as shape, configural cues, or 1-to-1 correspondence – as the basis for innovating more abstract number representations.
more » « less
Full Text Available
Parallel developmental changes in children’s production and recognition of line drawings of visual concepts

https://doi.org/10.1038/s41467-023-44529-9

Long, Bria; Fan, Judith E; Huey, Holly; Chai, Zixian; Frank, Michael C (February 2024, Nature Communications)

Childhood is marked by the rapid accumulation of knowledge and the prolific production of drawings. We conducted a systematic study of how children create and recognize line drawings of visual concepts. We recruited 2-10-year-olds to draw 48 categories via a kiosk at a children’s museum, resulting in >37K drawings. We analyze changes in the category-diagnostic information in these drawings using vision algorithms and annotations of object parts. We find developmental gains in children’s inclusion of category-diagnostic information that are not reducible to variation in visuomotor control or effort. Moreover, even unrecognizable drawings contain information about the animacy and size of the category children tried to draw. Using guessing games at the same kiosk, we find that children improve across childhood at recognizing each other’s line drawings. This work leverages vision algorithms to characterize developmental changes in children’s drawings and suggests that these changes reflect refinements in children’s internal representations.
more » « less
Full Text Available
How do communicative goals guide which data visualizations people think are effective?

Huey, Holly; Oey, Lauren; Lloyd, Hannah; Fan, Judith E (September 2023, Cognitive Science Society)

Data visualizations are powerful tools for communicating quantitative information. While prior work has focused on how experts design informative graphs, little is known about the intuitions non-experts have about what makes a graph effective for communicating a specific message. In the current study, we asked participants (N=398) which of eight graphs would be most useful for answering a particular question, where all graphs were generated from the same dataset but varied in how the data were arranged. We tested the degree to which participants based their decisions on sensitivity to how easily other participants (N=542) would be able to answer that question with that graph. Our results suggest that while people were biased towards graphs that were at least minimally informative (i.e., contained the relevant variables), their decisions did not necessarily reflect sensitivity to more graded but systematic variation in actual graph comprehensibility.
more » « less
Developmental changes in drawing production under different memory demands in a U.S. and Chinese sample.

https://doi.org/10.1037/dev0001600

Long, Bria; Wang, Ying; Christie, Stella; Frank, Michael C; Fan, Judith E (October 2023, Developmental Psychology)

Full Text Available
Drawing as a versatile cognitive tool

https://doi.org/10.1038/s44159-023-00212-w

Fan, Judith E; Bainbridge, Wilma A; Chamberlain, Rebecca; Wammes, Jeffrey D (September 2023, Nature Reviews Psychology)
Schubert, Teresa (Ed.)
Drawing is a cognitive tool that makes the invisible contents of mental life visible. Humans use this tool to produce a remarkable variety of pictures, from realistic portraits to schematic diagrams. Despite this variety and the prevalence of drawn images, the psychological mechanisms that enable drawings to be so versatile have yet to be fully explored. In this Review, we synthesize contemporary work in multiple areas of psychology, computer science and neuroscience that examines the cognitive processes involved in drawing production and comprehension. This body of findings suggests that the balance of contributions from perception, memory and social inference during drawing production varies depending on the situation, resulting in some drawings that are more realistic and other drawings that are more abstract. We also consider the use of drawings as a research tool for investigating various aspects of cognition, as well as the role that drawing has in facilitating learning and communication. Taken together, information about how drawings are used in different contexts illuminates the central role of visually grounded abstractions in human thought and behaviour.
more » « less
Full Text Available
Visual explanations prioritize functional properties at the expense of visual fidelity

https://doi.org/10.1016/j.cognition.2023.105414

Huey, Holly; Lu, Xuanchen; Walker, Caren M.; Fan, Judith E. (July 2023, Cognition)

Full Text Available
Visual resemblance and interaction history jointly constrain pictorial meaning

https://doi.org/10.1038/s41467-023-37737-w

Hawkins, Robert D.; Sano, Megumi; Goodman, Noah D.; Fan, Judith E. (April 2023, Nature Communications)

Abstract How do drawings—ranging from detailed illustrations to schematic diagrams—reliably convey meaning? Do viewers understand drawings based on how strongly they resemble an entity (i.e., as images) or based on socially mediated conventions (i.e., as symbols)? Here we evaluate a cognitive account of pictorial meaning in which visual and social information jointly support visual communication. Pairs of participants used drawings to repeatedly communicate the identity of a target object among multiple distractor objects. We manipulated social cues across three experiments and a full replication, finding that participants developed object-specific and interaction-specific strategies for communicating more efficiently over time, beyond what task practice or a resemblance-based account alone could explain. Leveraging model-based image analyses and crowdsourced annotations, we further determined that drawings did not drift toward “arbitrariness,” as predicted by a pure convention-based account, but preserved visually diagnostic features. Taken together, these findings advance psychological theories of how successful graphical conventions emerge.
more » « less
Learning to communicate about shared procedural abstractions

McCarthy, William P; Hawkins, Robert D; Wang, Haoliang; Holdaway, Cameron; Fan, Judith E (January 2021, Proceedings of the Annual Conference of the Cognitive Science Society)
null (Ed.)
Many real-world tasks require agents to coordinate their behavior to achieve shared goals. Successful collaboration requires not only adopting the same communicative conventions, but also grounding these conventions in the same task-appropriate conceptual abstractions. We investigate how humans use natural language to collaboratively solve physical assembly problems more effectively over time. Human participants were paired up in an online environment to reconstruct scenes containing two block towers. One participant could see the target towers, and sent assembly instructions for the other participant to reconstruct. Participants provided increasingly concise instructions across repeated attempts on each pair of towers, using more abstract referring expressions that captured each scene's hierarchical structure. To explain these findings, we extend recent probabilistic models of ad hoc convention formation with an explicit perceptual learning mechanism. These results shed light on the inductive biases that enable intelligent agents to coordinate upon shared procedural abstractions.
more » « less
Full Text Available

Search for: All records